Coordinating Access to Computation and Data in Distributed Systems

نویسنده

  • Douglas L. Thain
چکیده

Distributed computing has become a complex ecosystem of protocols and services for managing computation and data. Distributed applications are becoming complex as well. Users, particularly in scientific fields, wish to deploy large numbers of applications with complex dependencies and a large appetite for both computation and data. How may such systems and applications be brought together? I propose that applications deployed in distributed systems should be represented by an agent. The role of the agent is to transform an application’s abstract operations into concrete operations on the varying resources in a distributed system. The agent must hide the unpleasant aspects of individual resources while coordinating their activity in a manner specialized to each application. I examine four open problems in the design of agents for distributed computing. First, I explore a variety of techniques for coupling a job to an agent, informed by the experience of porting to different systems and deploying with several applications. Coupling an agent to a job via the debugger is by far the most reliable and usable technique and has acceptable overhead for scientific applications. Second, I describe the problem of coupling an agent to a variety of distributed data systems. This is difficult because of the subtle semantic differences between existing data interfaces. These differences result in the notion of an escaping error, which represents a runtime incompatibility between interfaces. Third, I present the problem of coupling an agent to a computation manager such as a batch system. This requires a careful discussion of errors in a distributed system. I develop ii a theory of error propagation and present the notion of error scope, which is needed to guide the propagation of escaping errors. Finally, I explain how an agent may coordinate the consumption of computation and data resources on behalf of a job. As a case study, I present BAD-FS, a system that executes data intensive batch workloads on faulty distributed systems. I conclude with quantitative evidence underscoring the importance of failure handling in distributed systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Job Scheduling in Data Grid Environment Based on Data and Computational Resource Availability

Data Grid is an infrastructure that controls huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. The heterogeneity and geographic dispersion of grid resources and applications place some complex problems such as job scheduling. Most existing scheduling algorithms in Grids only focus on one kind of Grid jobs which can be data...

متن کامل

Access control in ultra-large-scale systems using a data-centric middleware

  The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

E2DR: Energy Efficient Data Replication in Data Grid

Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004